-
Notifications
You must be signed in to change notification settings - Fork 4
PacBio #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PacBio #143
Conversation
@@ -207,43 +207,6 @@ def test_get_orig_names_from_sheet_with_replicates(self): | |||
|
|||
self.assertEqual(set(obs), exp) | |||
|
|||
def test_required_file_checks(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test were (1) specific to Illumina so they should not be part of Pipeline, (2) duplication of what normally happens in any code: check if a files exists and raise an error; thus, in my opinion, unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of typos and questions. Happy to look again when tests pass.
self.mandatory_attributes = ['qclient', 'uif_path', | ||
'lane_number', 'config_fp', | ||
'run_identifier', 'output_dir', 'job_id', | ||
'lane_number', 'is_restart'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does lane_number
need to be present twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a mistake; maybe we found the culprit of the duplicated lane?!
src/qp_klp/Protocol.py
Outdated
return failed_samples | ||
|
||
def generate_sequence_counts(self): | ||
# for other isntances of generate_sequence_counts in other objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: replace "isntances" with "instances"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
def generate_bam2fastq_commands(sample_list, run_folder, outdir, threads): | ||
"""Generates the bam2fastq commands""" | ||
df = pd.read_csv(sample_list, sep='\t') | ||
files = {f.split('.')[-2]: f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this split/slice grabbing from the file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some comments about this.
# add to bucket_size | ||
bucket_size += r1_size + r2_size | ||
current_size += r1_size | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we are hurrying, but can we add an issue to refactor this copy/paste addition in the future? There is so much shared functionality between the two branches of the if/else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create an issue pointing to what needs to be refactored.
|
||
for d in openfps.values(): | ||
for f in d.values(): | ||
f.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The amount of code duplication between this and the original demux
command feels to me like a real invitation to future bugs. Assuming we do not have time to refactor this now, can we have an issue to remind us to do it in the future?
if 'chemistry' in header: | ||
chemistry = header['chemistry'] | ||
else: | ||
chemistry = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest replacing with
chemistry = header.get('chemistry', '')
which will have the same effect, be shorter, and remove the need to repeat 'chemistry'
:)
mtag = tree.getroot().tag.split('}')[0] + '}' | ||
instrument_text = ( | ||
f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run') | ||
run = ET.parse(run_info).find(instrument_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really uncomfy with this being repeated. Could we make a method something like the below
def get_pacbio_run_str(run_info, run_directory):
pacbio_metadata_fps = glob(
f'{run_directory}/*/metadata/*.metadata.xml')
if not pacbio_metadata_fps:
raise ValueError(f"'{run_info}' doesn't exist")
run_info = pacbio_metadata_fps[0]
tree = ET.parse(run_info)
mtag = tree.getroot().tag.split('}')[0] + '}'
instrument_text = (
f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run')
run = ET.parse(run_info).find(instrument_text)
return run
and then call it in _get_instrument_id
like
if not exists(run_info):
run = get_pacbio_run_str(run_info, run_directory)
text = run.attrib['TimeStampedName']
and in _get_instrument_type
like
if not exists(run_info):
run = get_pacbio_run_str(run_info, run_directory)
date_string = run.attrib['CreatedAt'].split('T')[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you; this made me realize that we could use the same method and replace: get_date_from_run_id.
Description,,,,,,, | ||
,,,,,,, | ||
[Data],,,,,,, | ||
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Sample_Project,Well_description,Lane,barcode_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I don't know if you are round-tripping sample sheets at any point, but if you read this in and then write it back out, barcode_id
will show up between Sample_Well
and Sample_Project
, not at the end, due to the forced ordering of the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know.
tests/test_PacBioWorflow.py
Outdated
makedirs(f'{genprep_dir}/{sp}/filtered_sequences/', | ||
exist_ok=True) | ||
|
||
# # then loop over samples and stage all fastq.gz files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra #
?
demux-runner | ||
echo "$(date) :: demux stop" | ||
|
||
touch ${OUTPUT}/${SLURM_JOB_NAME}.${SLURM_ARRAY_TASK_ID}.completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it is almost the same as nuqc_job.sh
, with just a few deletions. Is this something we could consider refactoring in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AmandaBirmingham for your review. Added this issue: #144.
self.mandatory_attributes = ['qclient', 'uif_path', | ||
'lane_number', 'config_fp', | ||
'run_identifier', 'output_dir', 'job_id', | ||
'lane_number', 'is_restart'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a mistake; maybe we found the culprit of the duplicated lane?!
src/qp_klp/Protocol.py
Outdated
return failed_samples | ||
|
||
def generate_sequence_counts(self): | ||
# for other isntances of generate_sequence_counts in other objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
def generate_bam2fastq_commands(sample_list, run_folder, outdir, threads): | ||
"""Generates the bam2fastq commands""" | ||
df = pd.read_csv(sample_list, sep='\t') | ||
files = {f.split('.')[-2]: f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some comments about this.
# add to bucket_size | ||
bucket_size += r1_size + r2_size | ||
current_size += r1_size | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create an issue pointing to what needs to be refactored.
|
||
def run(self, callback=None): | ||
""" | ||
Run BCL2Fastq/BCLConvert conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updating.
@@ -52,13 +53,15 @@ def __init__(self, fastq_root_dir, output_path, sample_sheet_path, | |||
:param additional_fastq_tags: A list of fastq tags to preserve during | |||
filtering. | |||
:param files_regex: the FILES_REGEX to use for parsing files | |||
:param read_length: the FILES_REGEX to use for parsing files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching all these.
Description,,,,,,, | ||
,,,,,,, | ||
[Data],,,,,,, | ||
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Sample_Project,Well_description,Lane,barcode_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know.
mtag = tree.getroot().tag.split('}')[0] + '}' | ||
instrument_text = ( | ||
f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run') | ||
run = ET.parse(run_info).find(instrument_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you; this made me realize that we could use the same method and replace: get_date_from_run_id.
Thank you for all the help @AmandaBirmingham |
No description provided.